EDA for S&P 500 Index
Exploratory Data Analysis (EDA) for the S&P 500 Index involves analyzing time series data to identify the underlying patterns and characteristics of the index. The S&P 500 is a stock market index that measures the performance of 500 large companies listed on the US stock exchanges. EDA for the S&P 500 typically involves analyzing the daily closing prices of the index and examining key aspects such as autocorrelation, seasonality, trend, and stationarity. This information can be used to identify potential patterns and trends in the data, inform our modeling approach, and potentially improve our investment strategies.
Time Series Plot
Code
# get data
options("getSymbols.warning4.0"=FALSE)
options("getSymbols.yahoo.warning"=FALSE)
data = getSymbols("^GSPC",src='yahoo',from = '2010-01-01',to = "2023-03-01")
df <- data.frame(Date=index(GSPC),coredata(GSPC))
# create Bollinger Bands
bbands <- BBands(GSPC[,c("GSPC.High","GSPC.Low","GSPC.Close")])
# join and subset data
df <- subset(cbind(df, data.frame(bbands[,1:3])), Date >= "2010-01-01")
#export data
sp_raw_data <- df
write.csv(sp_raw_data, "DATA/CLEANED DATA/sp500_raw_data.csv", row.names=FALSE)
# colors column for increasing and decreasing
for (i in 1:length(df[,1])) {
if (df$GSPC.Close[i] >= df$GSPC.Open[i]) {
df$direction[i] = 'Increasing'
} else {
df$direction[i] = 'Decreasing'
}
}
i <- list(line = list(color = '#EBD168'))
d <- list(line = list(color = '#7F7F7F'))
# plot candlestick chart
fig <- df %>% plot_ly(x = ~Date, type="candlestick",
open = ~GSPC.Open, close = ~GSPC.Close,
high = ~GSPC.High, low = ~GSPC.Low, name = "GSPC",
increasing = i, decreasing = d)
fig <- fig %>% add_lines(x = ~Date, y = ~up , name = "B Bands",
line = list(color = '#ccc', width = 0.5),
legendgroup = "Bollinger Bands",
hoverinfo = "none", inherit = F)
fig <- fig %>% add_lines(x = ~Date, y = ~dn, name = "B Bands",
line = list(color = '#ccc', width = 0.5),
legendgroup = "Bollinger Bands", inherit = F,
showlegend = FALSE, hoverinfo = "none")
fig <- fig %>% add_lines(x = ~Date, y = ~mavg, name = "Mv Avg",
line = list(color = '#E377C2', width = 0.5),
hoverinfo = "none", inherit = F)
fig <- fig %>% layout(yaxis = list(title = "Price"))
# plot volume bar chart
fig2 <- df
fig2 <- fig2 %>% plot_ly(x=~Date, y=~GSPC.Volume, type='bar', name = "GSPC Volume",
color = ~direction, colors = c('#EBD168','#7F7F7F'))
fig2 <- fig2 %>% layout(yaxis = list(title = "Volume"))
# create rangeselector buttons
rs <- list(visible = TRUE, x = 0.5, y = -0.055,
xanchor = 'center', yref = 'paper',
font = list(size = 9),
buttons = list(
list(count=1,
label='RESET',
step='all'),
list(count=1,
label='1 YR',
step='year',
stepmode='backward'),
list(count=3,
label='3 MO',
step='month',
stepmode='backward'),
list(count=1,
label='1 MO',
step='month',
stepmode='backward')
))
# subplot with shared x axis
fig <- subplot(fig, fig2, heights = c(0.7,0.2), nrows=2,
shareX = TRUE, titleY = TRUE)
fig <- fig %>% layout(title = paste("S&P 500 Index Stock Price: January 2010 - March 2023"),
xaxis = list(rangeselector = rs),
legend = list(orientation = 'h', x = 0.5, y = 1,
xanchor = 'center', yref = 'paper',
font = list(size = 10),
bgcolor = 'transparent'))
figOver the period from January 2010 to March 2023, the S&P 500 index exhibited both upward and downward trends in its stock price. The index started off on a positive note, rising steadily from early 2010 to early 2011. However, it then experienced a significant decline in value, dropping by over 15% by October 2011. From there, the index began a slow but steady climb, reaching new all-time highs by mid-2015.
The S&P 500 continued to rise throughout 2016 and into 2017, with a few minor dips along the way. However, it then experienced a sharp drop in value in early 2018, losing over 10% of its value in just a few days. The index then regained some of its losses but continued to fluctuate throughout the remainder of 2018.
In 2019, the S&P 500 once again resumed its upward trend, with a few minor dips along the way. However, the COVID-19 pandemic in 2020 caused a significant drop in the index’s value, with the index losing nearly 34% of its value in just a few weeks. However, the index quickly rebounded, aided by government stimulus measures and low-interest rates, and continued its upward trend throughout 2020 and 2021.
Overall, the S&P 500 index’s stock price exhibited significant fluctuations over the period from January 2010 to March 2023. However, despite the fluctuations, the index exhibited an overall upward trend, reaching new all-time highs multiple times over the period.
For stock prices, a multiplicative decomposition is typically preferred because the percentage changes in stock prices tend to be more important than the absolute changes. Additionally, stock prices tend to exhibit non-constant variance, meaning that the variance of the series changes over time. A multiplicative decomposition can handle this non-constant variance more effectively than an additive decomposition.
Decomposed Time Series
Code
#convert data to ts data
myts<-ts(df$GSPC.Adjusted,frequency=252,start=c(2010,1,1))
orginial_plot <- autoplot(myts,xlab ="Year", ylab = "Adjusted Closing Price", main = "S&P 500 Index Stock price: Jan 2010 - March 2023")
decompose = decompose(myts, "multiplicative")
autoplot(decompose)Code
#adjusted decomposition
trendadj <- myts/decompose$trend
decompose_adjtrend_plot <- autoplot(trendadj,ylab='trend') +ggtitle('Adjusted trend component in the multiplicative time series model')
seasonaladj <- myts/decompose$seasonal
decompose_adjseasonal_plot <- autoplot(seasonaladj,ylab='seasonal') +ggtitle('Adjusted seasonal component in the multiplicative time series model')
grid.arrange(orginial_plot, decompose_adjtrend_plot,decompose_adjseasonal_plot, nrow=3)When compared to the original figure, the corrected seasonal component has an upward trend and greater variability in the model, but the adjusted trend component has a stable trend through time.
Lag Plots
Code
#Lag plots
gglagplot(myts, do.lines=FALSE, lags=1)+xlab("Lag 1")+ylab("Yi")+ggtitle("Lag Plot for S&P 500 Index Stock Jan 2010 - March 2023")Code
#montly data
mean_data <- df %>%
mutate(month = month(Date), year = year(Date)) %>%
group_by(year, month) %>%
summarize(mean_value = mean(GSPC.Adjusted))
#ts for month data
month<-ts(mean_data$mean_value,star=decimal_date(as.Date("2010-01-01",format = "%Y-%m-%d")),frequency = 12)
#Lag plot for month
ts_lags(month)S&P 500 Index Stock Lag Plot As there is a positive correlation and an inclination angle of 45 degrees, there should be a significant link between the series and the relevant lag from January 2010 to March 2023. This is the lag plot hallmark of a process that has a high degree of positive autocorrelation. Such processes are highly non-random, there is a substantial relationship between one observation and the next. Seasonality can also be investigated by plotting observations for a wider number of time periods, i.e. the lags. The time series data is aggregated to monthly data using the mean function for a better comprehension of the series and crisper plots. Further inspection of the last graph reveals that more dots are on the diagonal line at 45 degrees. The second graph shows the monthly variation of the variable on the vertical axis. In chronological order, the lines connect the points. This suggest that there is strong association between an observation and a succeeding observation.
Seasonality
Code
# Create seasonal plot
ts_heatmap(month, color = "YlOrBr", title = 'Seasonality Heatmap of S&P 500 Index Stock Jan 2010 - March 2023')Code
# Create a line graph for each year with months on the x-axis
ggseasonplot(month, datecol = "date", valuecol = "value")+ggtitle("Seasonal Yearly Plot for S&P 500 Index Stock Jan 2010 - March 2023")The Seasonality Heatmap for the S&P 500 Index stock from January 2010 to March 2023 shows some evidence of seasonality in the data, although the patterns are not consistent across all years. The heatmap displays the mean value of the time series for each month and year combination, with darker colors indicating higher values. The heatmap reveals that the S&P 500 Index tends to exhibit higher values during the months of December and January in many of the years studied. However, this pattern is not consistently observed across all years. Similarly, the yearly line graph also shows some evidence of seasonality, with a general trend of higher values during the latter part of the year. However, this trend is not present in all years, and the magnitude of the seasonal effect varies between years. Overall, the presence of some evidence of seasonality in the S&P 500 Index suggests that seasonality may be one of the factors contributing to the fluctuations in the stock price. However, other factors beyond seasonality, such as economic and political events, also play a significant role in determining the stock price.
Moving Average
Code
#SMA Smoothing - 4 month
ma <- autoplot(month, series="Data") +
autolayer(ma(month,5), series="4 month MA") +
xlab("Year") + ylab("GWh") +
ggtitle("S&P 500 Index Stock Jan 2010 - March 2023 (4 Month Moving Average") +
scale_colour_manual(values=c("Data"="grey50","4 month MA"="red"),
breaks=c("Data","4 month MA"))
maCode
#SMA Smoothing - 1 Year
ma <- autoplot(month, series="Data") +
autolayer(ma(month,13), series="1 Year MA") +
xlab("Year") + ylab("GWh") +
ggtitle("S&P 500 Index Stock Jan 2010 - March 2023 (1 Year Moving Average") +
scale_colour_manual(values=c("Data"="grey50","1 Year MA"="red"),
breaks=c("Data","1 Year MA"))
maCode
#SMA Smoothing - 3 Year
ma <- autoplot(month, series="Data") +
autolayer(ma(month,37), series="3 Year MA") +
xlab("Year") + ylab("GWh") +
ggtitle("S&P 500 Index Stock Jan 2010 - March 2023 (3 Year Moving Average)") +
scale_colour_manual(values=c("Data"="grey50","3 Year MA"="red"),
breaks=c("Data","3 Year MA"))
maCode
#SMA Smoothing - 5 Year
ma <- autoplot(month, series="Data") +
autolayer(ma(month,61), series="5 Year MA") +
xlab("Year") + ylab("GWh") +
ggtitle("S&P 500 Index Stock Jan 2010 - March 2023 (5 Year Moving Average)") +
scale_colour_manual(values=c("Data"="grey50","5 Year MA"="red"),
breaks=c("Data","5 Year MA"))
maThe four plots above show the S&P 500 Index stock prices for the period between Jan 2010 and March 2023, smoothed using 4-month, 1-year, 3-year and 5-year moving averages.
Looking at the plots, we can see that the moving average values are increasing over time. The 4-month MA plot shows a lot of fluctuations, which is expected because it captures the short-term variations in the stock prices. On the other hand, the 1-year and 3-year and 5-year MA plots smooth out the fluctuations and show the overall trend of the stock prices.
We can observe that the 5-year MA plot provides a smoother trend as it takes into account a longer time period compared to the other plots. The 1-year MA plot also provides a relatively smooth trend, but it captures shorter-term variations compared to the 5-year MA plot. The 4-month MA plot is even more sensitive to shorter-term variations in the stock prices. From the moving average obtained above we can see that there is upward tend in the stock price of S&P 500 Index.
Autocorrelation Time Series
Code
#ACF plots for month data
ggAcf(month)+ggtitle("ACF Plot for S&P 500 Index Stock Jan 2010 - March 2023")Code
#PACF plots for month data
ggPacf(month)+ggtitle("PACF Plot for S&P 500 Index Stock Jan 2010 - March 2023")Code
# ADF Test
tseries::adf.test(month)
Augmented Dickey-Fuller Test
data: month
Dickey-Fuller = -2.5431, Lag order = 5, p-value = 0.3499
alternative hypothesis: stationary
There is clear autocorrelation in lag in the plot of autocorrelation function, which is the acf graph for monthly data. The lag plots and autocorrelation plots shown above suggest seasonality in the series, indicating that it is not stationary. It was also validated using the Augmented Dickey-Fuller Test, which indicates that the series is not stationary as the p value is more than 0.05.
Detrend and Differenced Time Series
Code
fit = lm(myts~time(myts), na.action=NULL)
summary(fit)
Call:
lm(formula = myts ~ time(myts), na.action = NULL)
Residuals:
Min 1Q Median 3Q Max
-1100.29 -200.78 -24.77 78.55 1008.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.064e+05 2.628e+03 -192.7 <2e-16 ***
time(myts) 2.523e+02 1.303e+00 193.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.5 on 3309 degrees of freedom
Multiple R-squared: 0.9189, Adjusted R-squared: 0.9188
F-statistic: 3.748e+04 on 1 and 3309 DF, p-value: < 2.2e-16
Code
# plot ACFs
plot1 <- ggAcf(myts, 48, main="Original Data: S&P 500 Index Stock Price")
plot2 <- ggAcf(resid(fit), 48, main="detrended data")
plot3 <- ggAcf(diff(myts), 48, main="first differenced data")
grid.arrange(plot1, plot2, plot3,nrow=3)The estimated slope coefficient β1, 3.655e+02. With a standard error of 1.888e+00, yielding a significant estimated increase of stock price is very less yearly. Equation of the fit for stationary process: \[\hat{y}_{t} = x_{t}+(7.338e+05)-(3.655e+02)t\]
From the above graph we can say that there is high correlation in the original plot, but in the detrended plot the correlation is reduced but there is still high correlation in the detrended data.But when the first order difference is applied the high correlation is removed but there is seasonal correlation.
As depicted in the above figure, the series is now stationary and ready for future study.